See quick reference at the bottom
See full module reference section for full details

In the begining of each analysis, the first step is to load ReproPhylo and its dependencies with the command


In [1]:
from reprophylo import *

Once this is done we can start a Project. A Project contains all the data, metadata, methods and environment information, and it is the unit that is saved as a pickle file, which is version controled with Git.

Although ReproPhylo is designed to record versions and update the pickle file automatically, this will be opt-out of in this tutorial, and will be introduced after we have covered the basics.

Instead, we will manually save a pickle file at the end of each section, and will load it in the next one. You should use the same pickle file name at the end of all the sections. The new content will be added to the one already present in the file.

If you want to jump ahead, there are presaved pickle files (Tutorial_files/basic/outputs), numbered according to the section after which they were saved. For example, outputs/3.6.alignments.pkpj was saved at the end of section 3.6 and can be loaded at the top of section 3.7, instead of your own file.

To start a Project, we have to specify the loci to analyse (not actual sequence data, only some information on the loci) and a pickle file name.

3.2.1 Describing Loci

A Locus can be described manually using a command or by providing a file. For each Locus, we have to specify the character type (DNA or protein) the feature type (eg, rRNA, CDS or gene), the name of the locus (eg, MT-CO1) and other possible aliases which may come handy if we want to read a genbank file (eg, cox1, coi).
Describe loci using a command


In [2]:
coi = Locus(char_type='dna', 
            feature_type='CDS', 
            name='MT-CO1',
            aliases=['cox1', 'coi'])

This is a single Locus description (a Locus object). We can confirm its content by printing it like this:


In [3]:
print coi


Locus(char_type=dna, feature_type=CDS, name=MT-CO1, aliases=cox1; coi)

Describing loci using a file
Another way of describing loci is to write them in a file. The file has one line for each Locus, where each line has at least four items, separated by commas. The items, as above, are the character type, the feature type, the name of the locus and other possible aliases. At least one alias must be specified, but it can be identical to the name. For the MT-CO1 Locus, a file would look like this:

dna,CDS,MT-CO1,cox1,coi

Deducing a loci file from a genbank file

A third way of describing loci is to run a command that guesses them from a genbank file and writes them into a comma delimited file, as above. This file can be used as is, or it can be edited. The following command will prepare such a loci file from a genbank file containing all the GenBank records belonging to the sponge family Tetillidae. Text starting with a hash (#) is a comment which do not affect the command:


In [4]:
list_loci_in_genbank('data/Tetillidae.gb', # The input genbank
                                           # file
                     
                     'data/loci.csv',      # The loci file
                     
                     'outputs/loci_counts.txt') # Additional
                                                # output,
                                                # discussed
                                                # below.

The command generated the loci file and wrote it in data/loci.csv. Here are some excerpts separated by three dots:

dna,rRNA,18s,18S ribosomal RNA,18S rRNA
dna,rRNA,28s,28S large subunit ribosomal RNA,28S ribosomal RNA
...
dna,CDS,MT-ATP8,atp8,ATP8
dna,CDS,MT-CO1,coi,COI,cox1,COX1,coxI
...
dna,rRNA,rnl,rnl
dna,rRNA,rns,rns
dna,rRNA,rrnL,rrnL

Each line represents a locus that was found in the genbank file data/Tetillidae.gb. For some genes, such as 18s, synonyms were recognized and placed as aliases in one line. In other cases, such as for rnl and rrnL, they were not.

Editing the loci file
Possible edits to this file include:

  • Synonymization. This is done by adding a comma and a shared integer in all the lines that are the same locus. For example the lines

       dna,rRNA,rnl,rnl
       dna,rRNA,rrnL,rrnL

       will become

       dna,rRNA,rnl,rnl,9
       dna,rRNA,rrnL,rrnL,9

       Which integer is written is unimportant, as long as it is shared between synonymous lines.

  • Change of character type. If our data includes translations to protein sequence, we can change dna to prot, as such:

       prot,CDS,MT-CO1,coi,COI,cox1,COX1,coxI.

       This will tell the program to use protein sequences instead of DNA sequence. The sequence alignment tutorial explains how to use both protein and DNA sequence of the same locus to conduct codon alignment.

  • Deletion of loci. It is possible to delete loci we do not want to analyse. They will not be read, even if they exit in our data.

The second file that the command above produced, the outputs/loci_counts.txt, contains a list of the loci found in the genbank file, with the number of their occurances. This can be used as a guide when desciding which loci to delete and which to keep.

3.2.2 Loading loci to a new Project

Loading Locus objects
First we'll make another Locus object to make a point that more than one can be read:


In [5]:
ssu = Locus('dna','rRNA','18S',['ssu','SSU-rRNA'])

Regardless of whether we have one or more Locus objects, they are read as a list, which means that they are wrapped with square brackets and separated by comma:


In [6]:
loci_list = [coi, ssu]

This command will start the Project and will write it to the pickle file outputs/dummy.pkpj:

pj = Project(loci_list, pickle='outputs/dummy.pkpj')

This following alternative will start a Project and will load the loci from a file data/edited_loci.csv that looks like this:

dna,rRNA,18s,18S ribosomal RNA,18S rRNA
dna,rRNA,28s,28S large subunit ribosomal RNA
dna,CDS,MT-CO1,coi,COI,cox1,COX1,coxI

In [7]:
pj = Project('data/edited_loci.csv',
             pickle='outputs/my_project.pkpj', git=False)


DEBUG:Cloud:Log file (/home/amir/.picloud/cloud.log) opened

This will provoke a bunch of Git related messages which will be discussed in the version control section of this tutorial.
If we print the Project we'll get this massage:


In [8]:
print pj


Project object with the loci 18s,28s,MT-CO1,

3.2.3 Modifying the loci of an existing Project

As you have seen, when you start a Project you pass a list of loci or a csv file name with the loci attributes:

pj = Project(loci_list, pickle='filename')

Once the Project exists, it is possible to modify the Locus objects it contains. To add a Locus, you need to create it, as you have done:

lsu = Locus('dna', 'rRNA', '28S', ['28s','LSU-rRNA'])

and then also add it to the Project. Loci are stored in a list called pj.loci. So the new Locus can be appended to it:

pj.loci.append(ssu)

or if we have a list of new loci to add, for example:

new_loci_list = [nd5, lsu]

it can be added to the loci list like so:

pj.loci += new_loci_list

Lastly, we can modify loci that are already in pj.loci. For example, change the name and add an alias to the MT-CO1 Locus object:

for l in pj.loci:                # Find the Locus named MT-CO1
    if l.name == 'MT-CO1':
        l.name = 'COI'           # Rename it to COI
        l.aliases.append('coi')  # Add the alias coi

In [11]:
# Update the pickle file
pickle_pj(pj, 'outputs/my_project.pkpj')


Out[11]:
'outputs/my_project.pkpj'

3.2.4 Quick reference


In [ ]:
# A Locus object
coi = Locus(char_type='dna',         # or 'prot'
            feature_type='CDS',      # any string
            name='MT-CO1',           # any string
            aliases=['coi', 'cox1']) # list of strings

# Guess loci.csv file from a genbank file
list_loci_in_genbank('genbank.gb',
                     'loci.csv',
                     'loci_counts.txt')

# Start a Project
# With a Locus object list
pj = Project([coi, ssu], pickle='pickle_filename')

# With a loci.csv file
pj = Project('loci.csv', pickle='pickle_filename')

# Add a Locus to an existing Project
pj.loci.append(coi)
#Or
pj.loci += [coi]

# Modify a Locus existing in a Project
for l in pj.loci:
    if l.name == 'MT-CO1':
        l.name = 'newName'
        l.feature_type = 'newFeatureType'
        l.char_type = 'prot'
        l.aliases.append('newAlias')
        #Or
        l.aliases += ['newAlias1,newAlias2']